AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
import pandas as pd # Imports Pandas library for data manipulation and analisys
import numpy as np # Imports Numpy package for advance math operations
import seaborn as sns # Imports Seaborn Phython data vizualization library (popular & nicer charts)
sns.set (color_codes=True) # Adds a nice background to the graphs
import warnings # Imports the warnings (step 1 of 2 to ignore them)
warnings.filterwarnings("ignore") # This prevents the warnings from showing (step 2 of 2 to ignore them)
import matplotlib.pyplot as plt # Imports Matplotlib Python's sub-library for standard charts
# The line below instrucs Jupiter to show the chart within the same window
%matplotlib inline
import statsmodels.api as sm # explores data, performs statistical tests and estimates statistical models
from sklearn.model_selection import train_test_split # Sklearn package's randomized data splitting function
from sklearn.linear_model import LogisticRegression # statistical model. Finds the relationship between a dependent variable (Y) with an independent variable or set of variables (X)
from sklearn.metrics import mean_squared_error# Standard deviation of the residuals to estimate how good is the model
pd.set_option('display.max_columns', None) #Removes the limit from the number of displayed columns so the entire width of the data frame is displayed.
pd.set_option('display.max_rows', 200) #Extends the the number of displayed rows to 200 so we can see access more data to review. More than this amount could crash the notebook.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score #Calculates the square error to determine the efficiency of the model
from statsmodels.stats.outliers_influence import variance_inflation_factor #Plot outliers
from sklearn import metrics # Builds the confussion matix
from sklearn.metrics import roc_auc_score # AUC Curve score
from sklearn.metrics import roc_curve # AUC curve plotting
import statsmodels.api as sm # Builds logistic Regression from stats model
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn import tree # Decision Tree vizualiser
from sklearn.model_selection import GridSearchCV # Grit search. Tunning for tree model
from sklearn.metrics import accuracy_score # Calculates the accuracy of the model
from statsmodels.tools.tools import add_constant # Checks for multicollinearity
# Reads the file. No need to indicate directory if the file in the same location as Jupiter notebook.
# The first column is the index since it provides the same information
file_data = pd.read_csv('Loan_Modelling.csv',index_col=0)
print(f'There are {file_data.shape[0]} rows and {file_data.shape[1]} columns in the dataset') # f-string
There are 5000 rows and 13 columns in the dataset
data = file_data.copy() # Best practice to preserve the integrity of the raw data
np.random.seed(1) #Generates pseudo-randon rows. We get the same results every time.
#Random displays gives us a better undertanding of the overall data formatting.
data.sample(n=10)
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||
| 2765 | 31 | 5 | 84 | 91320 | 1 | 2.9 | 3 | 105 | 0 | 0 | 0 | 0 | 1 |
| 4768 | 35 | 9 | 45 | 90639 | 3 | 0.9 | 1 | 101 | 0 | 1 | 0 | 0 | 0 |
| 3815 | 34 | 9 | 35 | 94304 | 3 | 1.3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3500 | 49 | 23 | 114 | 94550 | 1 | 0.3 | 1 | 286 | 0 | 0 | 0 | 1 | 0 |
| 2736 | 36 | 12 | 70 | 92131 | 3 | 2.6 | 2 | 165 | 0 | 0 | 0 | 1 | 0 |
| 3923 | 31 | 4 | 20 | 95616 | 4 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2702 | 50 | 26 | 55 | 94305 | 1 | 1.6 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1180 | 36 | 11 | 98 | 90291 | 3 | 1.2 | 3 | 0 | 0 | 1 | 0 | 0 | 1 |
| 933 | 51 | 27 | 112 | 94720 | 3 | 1.8 | 2 | 0 | 0 | 1 | 1 | 1 | 1 |
| 793 | 41 | 16 | 98 | 93117 | 1 | 4.0 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
data.head() #Prints the first 5 rows
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||
| 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail() #Prints the last 5 rows
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||
| 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data.dtypes.value_counts() # This dataset has 9 columns with categorical data type and 4 columns with numeric data type
int64 12 float64 1 dtype: int64
data.info() # Below the details on the 13 columns, their respective data types, and also the presence of any null values
<class 'pandas.core.frame.DataFrame'> Int64Index: 5000 entries, 1 to 5000 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null int64 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null int64 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null int64 9 Securities_Account 5000 non-null int64 10 CD_Account 5000 non-null int64 11 Online 5000 non-null int64 12 CreditCard 5000 non-null int64 dtypes: float64(1), int64(12) memory usage: 546.9 KB
def null_values(x): #This is an user defined function to check the presence of null values
null_values= x.isnull().sum().sort_values(ascending=False) # Total null values. Sort false: does not sort alphabetically
percent_null= (x.isnull().sum()/x.isnull().count()).sort_values(ascending=False) # % of null values
NaN_data = pd.concat([null_values,percent_null],axis=1,keys=["Total null values", "% of null values"])# Combines (concatenate) null-values and percent_null. axis=1 creates multiple columns
return NaN_data #Returns the dataframe (list of lists)
null_values(data)
| Total null values | % of null values | |
|---|---|---|
| CreditCard | 0 | 0.0 |
| Online | 0 | 0.0 |
| CD_Account | 0 | 0.0 |
| Securities_Account | 0 | 0.0 |
| Personal_Loan | 0 | 0.0 |
| Mortgage | 0 | 0.0 |
| Education | 0 | 0.0 |
| CCAvg | 0 | 0.0 |
| Family | 0 | 0.0 |
| ZIPCode | 0 | 0.0 |
| Income | 0 | 0.0 |
| Experience | 0 | 0.0 |
| Age | 0 | 0.0 |
duplicates=data.duplicated() #checks for duplicates in the data
sum(duplicates)
0
data.describe(include="all").T # Checking the new value counts
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.0 | 93437.0 | 94608.0 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
def histogram_boxplot(feature, figsize=(15,10), bins = None): # Function to automatically generate charts when called
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='red') # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
histogram_boxplot(data["Age"]);
histogram_boxplot(data["Experience"]);
histogram_boxplot(data["Income"]);
histogram_boxplot(data["Family"]);
histogram_boxplot(data["CCAvg"]);
histogram_boxplot(data["Education"]);
histogram_boxplot(data["Mortgage"]);
plt.figure(figsize=(15,13))
sns.scatterplot(y='Income', x='Age', hue='Personal_Loan', data=data);
plt.figure(figsize=(15,13))
sns.scatterplot(y='Income', x='Education', hue='Personal_Loan', data=data);
plt.figure(figsize=(15,13))
sns.scatterplot(y='Income', x='Mortgage', hue='Personal_Loan', data=data);
plt.figure(figsize=(15,13))
sns.scatterplot(y='Income', x='Family', hue='Personal_Loan', data=data);
tab1 = pd.crosstab(data.Personal_Loan,data.CD_Account,margins=True)
tab = pd.crosstab(data.Personal_Loan,data.CD_Account)
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left",bbox_to_anchor=(1,1),title="CD Account 0=No and 1=Yes");
tab1 = pd.crosstab(data.Personal_Loan,data.Securities_Account,margins=True)
tab = pd.crosstab(data.Personal_Loan,data.Securities_Account)
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left", bbox_to_anchor=(1,1),title="Securities Account 0=No and 1=Yes");
plt.figure(figsize=(20,30))
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(5,4,i+1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Lets us look at quantile of capital loss
data.Income.quantile([.1,.2,.3,.4,.5,.6,.7,.8,.9,.95,.98,.99,1])
0.10 22.0 0.20 33.0 0.30 42.0 0.40 52.0 0.50 64.0 0.60 78.0 0.70 88.3 0.80 113.0 0.90 145.0 0.95 170.0 0.98 185.0 0.99 193.0 1.00 224.0 Name: Income, dtype: float64
q=data["Income"].quantile(0.99)# Removing 1% of the data
data1=data[data["Income"]<q]
# Lets us look at quantile of capital loss
data1.CCAvg.quantile([.1,.2,.3,.4,.5,.6,.7,.8,.9,.95,.98,.99,1])
0.10 0.3 0.20 0.5 0.30 0.8 0.40 1.1 0.50 1.5 0.60 1.9 0.70 2.2 0.80 2.8 0.90 4.1 0.95 5.7 0.98 7.2 0.99 7.8 1.00 9.3 Name: CCAvg, dtype: float64
q=data1["CCAvg"].quantile(0.99) # Removing 1% of the data
data2=data1[data1["CCAvg"]<q]
# Lets us look at quantile of capital loss
data2.Mortgage.quantile([.1,.2,.3,.4,.5,.6,.7,.8,.9,.95,.98,.99,1])
0.10 0.00 0.20 0.00 0.30 0.00 0.40 0.00 0.50 0.00 0.60 0.00 0.70 78.00 0.80 121.00 0.90 196.00 0.95 263.30 0.98 353.12 0.99 415.06 1.00 612.00 Name: Mortgage, dtype: float64
q=data2["Mortgage"].quantile(0.99)# Removing 1% of the data
data3=data2[data2["CCAvg"]<q]
plt.figure(figsize=(20,30))
numeric_columns = data3.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(5,4,i+1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
data3.drop(["ZIPCode","Online"],axis=1,inplace=True) # Zip code and online are being excluded from this analysis
data3.drop(data3[data3['Experience'] < 0].index, inplace = True)# Removing negative values from Experience
data_cleaned=data3.reset_index(drop=True)
data_cleaned.describe(include="all").T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 4843.0 | 45.624200 | 11.343825 | 24.0 | 36.0 | 46.0 | 55.0 | 67.0 |
| Experience | 4843.0 | 20.393764 | 11.338581 | 0.0 | 11.0 | 20.0 | 30.0 | 43.0 |
| Income | 4843.0 | 71.450547 | 43.584931 | 8.0 | 38.0 | 63.0 | 93.0 | 192.0 |
| Family | 4843.0 | 2.401198 | 1.152326 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 4843.0 | 1.828827 | 1.571200 | 0.0 | 0.7 | 1.5 | 2.5 | 7.6 |
| Education | 4843.0 | 1.888705 | 0.838831 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 4843.0 | 55.355152 | 98.624209 | 0.0 | 0.0 | 0.0 | 101.0 | 612.0 |
| Personal_Loan | 4843.0 | 0.092505 | 0.289767 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 4843.0 | 0.104274 | 0.305647 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 4843.0 | 0.060087 | 0.237672 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CreditCard | 4843.0 | 0.294446 | 0.455840 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
data_cleaned["Income_Per_Month"] = data_cleaned["Income"]/12 # Income in years divided by 12 to get monthly values
data_cleaned["Mortgage_Per_Month"]= data_cleaned["Mortgage"]/(30*12) # Mortgage amount divided by 30 years and then 12 months to get monthly amount
data_cleaned.drop(["Income","Mortgage"],axis=1,inplace=True)
data_cleaned.describe(include="all").T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 4843.0 | 45.624200 | 11.343825 | 24.000000 | 36.000000 | 46.00 | 55.000000 | 67.0 |
| Experience | 4843.0 | 20.393764 | 11.338581 | 0.000000 | 11.000000 | 20.00 | 30.000000 | 43.0 |
| Family | 4843.0 | 2.401198 | 1.152326 | 1.000000 | 1.000000 | 2.00 | 3.000000 | 4.0 |
| CCAvg | 4843.0 | 1.828827 | 1.571200 | 0.000000 | 0.700000 | 1.50 | 2.500000 | 7.6 |
| Education | 4843.0 | 1.888705 | 0.838831 | 1.000000 | 1.000000 | 2.00 | 3.000000 | 3.0 |
| Personal_Loan | 4843.0 | 0.092505 | 0.289767 | 0.000000 | 0.000000 | 0.00 | 0.000000 | 1.0 |
| Securities_Account | 4843.0 | 0.104274 | 0.305647 | 0.000000 | 0.000000 | 0.00 | 0.000000 | 1.0 |
| CD_Account | 4843.0 | 0.060087 | 0.237672 | 0.000000 | 0.000000 | 0.00 | 0.000000 | 1.0 |
| CreditCard | 4843.0 | 0.294446 | 0.455840 | 0.000000 | 0.000000 | 0.00 | 1.000000 | 1.0 |
| Income_Per_Month | 4843.0 | 5.954212 | 3.632078 | 0.666667 | 3.166667 | 5.25 | 7.750000 | 16.0 |
| Mortgage_Per_Month | 4843.0 | 0.153764 | 0.273956 | 0.000000 | 0.000000 | 0.00 | 0.280556 | 1.7 |
sns.set(font_scale=1.4)
fig,ax=plt.subplots(figsize=(30,15))
sns.heatmap(data_cleaned.corr(), annot=True,cmap='coolwarm');#highlevel view of all numerical variables
sns.pairplot((data_cleaned),hue="Personal_Loan");
#Creating a new variable called "Balance left"
data_cleaned["Balance_left"]=data_cleaned["Income_Per_Month"]-data_cleaned["Mortgage_Per_Month"]-data_cleaned["CCAvg"]
data_cleaned.head()
| Age | Experience | Family | CCAvg | Education | Personal_Loan | Securities_Account | CD_Account | CreditCard | Income_Per_Month | Mortgage_Per_Month | Balance_left | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 4 | 1.6 | 1 | 0 | 1 | 0 | 0 | 4.083333 | 0.0 | 2.483333 |
| 1 | 45 | 19 | 3 | 1.5 | 1 | 0 | 1 | 0 | 0 | 2.833333 | 0.0 | 1.333333 |
| 2 | 39 | 15 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0.916667 | 0.0 | -0.083333 |
| 3 | 35 | 9 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 8.333333 | 0.0 | 5.633333 |
| 4 | 35 | 8 | 4 | 1.0 | 2 | 0 | 0 | 0 | 1 | 3.750000 | 0.0 | 2.750000 |
data_cleaned[data_cleaned["Balance_left"]<0] # Identifing clients with 0 income after paying for CCAvg and Mortgage expenses
| Age | Experience | Family | CCAvg | Education | Personal_Loan | Securities_Account | CD_Account | CreditCard | Income_Per_Month | Mortgage_Per_Month | Balance_left | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 39 | 15 | 1 | 1.00 | 1 | 0 | 0 | 0 | 0 | 0.916667 | 0.000000 | -0.083333 |
| 47 | 32 | 8 | 4 | 0.70 | 2 | 0 | 1 | 0 | 0 | 0.666667 | 0.000000 | -0.033333 |
| 105 | 41 | 14 | 3 | 1.00 | 2 | 0 | 0 | 0 | 0 | 0.750000 | 0.000000 | -0.250000 |
| 174 | 36 | 12 | 4 | 0.70 | 2 | 0 | 0 | 0 | 0 | 0.833333 | 0.225000 | -0.091667 |
| 188 | 34 | 10 | 4 | 1.00 | 1 | 0 | 1 | 0 | 0 | 1.083333 | 0.263889 | -0.180556 |
| 190 | 55 | 31 | 4 | 0.70 | 1 | 0 | 0 | 0 | 0 | 0.750000 | 0.247222 | -0.197222 |
| 282 | 43 | 16 | 3 | 0.67 | 2 | 0 | 0 | 0 | 0 | 0.666667 | 0.244444 | -0.247778 |
| 649 | 47 | 23 | 1 | 0.90 | 3 | 0 | 0 | 0 | 1 | 0.916667 | 0.286111 | -0.269444 |
| 828 | 50 | 23 | 2 | 1.00 | 2 | 0 | 0 | 0 | 0 | 1.250000 | 0.280556 | -0.030556 |
| 893 | 50 | 23 | 2 | 1.00 | 2 | 0 | 0 | 0 | 0 | 0.750000 | 0.000000 | -0.250000 |
| 959 | 59 | 35 | 4 | 0.70 | 1 | 0 | 1 | 0 | 0 | 0.666667 | 0.252778 | -0.286111 |
| 1193 | 51 | 26 | 2 | 0.70 | 3 | 0 | 0 | 0 | 1 | 1.000000 | 0.302778 | -0.002778 |
| 1579 | 59 | 34 | 3 | 1.30 | 2 | 0 | 0 | 1 | 1 | 1.500000 | 0.288889 | -0.088889 |
| 1854 | 38 | 13 | 2 | 1.40 | 2 | 0 | 0 | 0 | 1 | 1.583333 | 0.333333 | -0.150000 |
| 1914 | 41 | 17 | 1 | 1.00 | 1 | 0 | 0 | 0 | 0 | 0.916667 | 0.000000 | -0.083333 |
| 2020 | 39 | 9 | 3 | 2.00 | 3 | 0 | 1 | 0 | 0 | 2.416667 | 0.419444 | -0.002778 |
| 2062 | 59 | 35 | 2 | 1.00 | 1 | 0 | 0 | 0 | 1 | 0.916667 | 0.000000 | -0.083333 |
| 2121 | 25 | 1 | 4 | 1.00 | 1 | 0 | 0 | 0 | 1 | 1.083333 | 0.263889 | -0.180556 |
| 2176 | 63 | 37 | 1 | 0.80 | 2 | 0 | 0 | 0 | 0 | 0.666667 | 0.269444 | -0.402778 |
| 2184 | 56 | 31 | 4 | 0.90 | 2 | 0 | 0 | 1 | 1 | 1.083333 | 0.211111 | -0.027778 |
| 2329 | 57 | 32 | 4 | 0.90 | 2 | 0 | 1 | 0 | 0 | 1.083333 | 0.216667 | -0.033333 |
| 2418 | 33 | 9 | 3 | 0.90 | 3 | 0 | 0 | 0 | 0 | 1.166667 | 0.316667 | -0.050000 |
| 2420 | 53 | 27 | 4 | 2.80 | 2 | 0 | 1 | 0 | 0 | 3.166667 | 0.400000 | -0.033333 |
| 2440 | 60 | 36 | 2 | 1.00 | 1 | 0 | 0 | 0 | 1 | 0.833333 | 0.000000 | -0.166667 |
| 2452 | 59 | 35 | 2 | 1.00 | 1 | 0 | 0 | 0 | 0 | 1.166667 | 0.297222 | -0.130556 |
| 2493 | 45 | 18 | 3 | 0.67 | 2 | 0 | 0 | 0 | 0 | 0.833333 | 0.277778 | -0.114444 |
| 2510 | 31 | 7 | 4 | 0.70 | 2 | 0 | 0 | 0 | 0 | 0.666667 | 0.000000 | -0.033333 |
| 2600 | 51 | 25 | 1 | 1.40 | 3 | 0 | 0 | 0 | 0 | 1.583333 | 0.272222 | -0.088889 |
| 2641 | 62 | 37 | 1 | 1.50 | 2 | 0 | 0 | 0 | 0 | 1.500000 | 0.352778 | -0.352778 |
| 2745 | 40 | 16 | 1 | 1.00 | 1 | 0 | 1 | 0 | 0 | 1.000000 | 0.252778 | -0.252778 |
| 2908 | 63 | 37 | 1 | 0.80 | 2 | 0 | 0 | 0 | 1 | 0.916667 | 0.283333 | -0.166667 |
| 3686 | 48 | 24 | 4 | 1.00 | 1 | 0 | 0 | 0 | 0 | 1.000000 | 0.247222 | -0.247222 |
| 3919 | 53 | 26 | 2 | 1.00 | 2 | 0 | 0 | 0 | 0 | 1.166667 | 0.230556 | -0.063889 |
| 4051 | 33 | 9 | 4 | 1.00 | 1 | 0 | 0 | 0 | 1 | 0.833333 | 0.225000 | -0.391667 |
| 4077 | 50 | 23 | 1 | 0.50 | 2 | 0 | 0 | 0 | 0 | 0.750000 | 0.272222 | -0.022222 |
| 4133 | 45 | 19 | 3 | 1.50 | 1 | 0 | 0 | 0 | 1 | 1.583333 | 0.261111 | -0.177778 |
| 4172 | 32 | 8 | 3 | 0.90 | 3 | 0 | 0 | 0 | 0 | 1.166667 | 0.308333 | -0.041667 |
| 4185 | 49 | 24 | 4 | 0.80 | 1 | 0 | 0 | 0 | 0 | 1.083333 | 0.308333 | -0.025000 |
| 4279 | 63 | 38 | 4 | 0.60 | 2 | 0 | 0 | 0 | 1 | 0.750000 | 0.277778 | -0.127778 |
| 4363 | 26 | 1 | 2 | 0.90 | 3 | 0 | 0 | 0 | 1 | 0.666667 | 0.000000 | -0.233333 |
| 4367 | 41 | 17 | 1 | 1.00 | 1 | 0 | 0 | 0 | 1 | 0.750000 | 0.000000 | -0.250000 |
| 4554 | 60 | 36 | 2 | 1.00 | 1 | 0 | 0 | 0 | 0 | 0.666667 | 0.000000 | -0.333333 |
| 4745 | 52 | 26 | 1 | 1.40 | 3 | 0 | 0 | 0 | 0 | 1.583333 | 0.266667 | -0.083333 |
*1549 customers have a negative balance
len(data_cleaned[(data_cleaned["Balance_left"]<0) & (data_cleaned["Personal_Loan"]==1)])# Counts the number of clients who have a loan and end up with no money after paying for expenses
0
data_cleaned.drop(["Balance_left"],axis=1,inplace=True) # Removing Balance left variable
x=data_cleaned.drop(["Personal_Loan"],axis=1) # The rest of the variables are features
Y=data_cleaned["Personal_Loan"] # Personal loan is the target variable
x_train,x_test,Y_train,Y_test=train_test_split(x,Y,test_size=0.3,random_state=1)# Splitting data into training and test set
x_train.head()
| Age | Experience | Family | CCAvg | Education | Securities_Account | CD_Account | CreditCard | Income_Per_Month | Mortgage_Per_Month | |
|---|---|---|---|---|---|---|---|---|---|---|
| 545 | 55 | 29 | 3 | 0.80 | 1 | 0 | 0 | 0 | 6.583333 | 0.0 |
| 4389 | 41 | 17 | 4 | 2.67 | 1 | 0 | 0 | 0 | 6.916667 | 0.0 |
| 2351 | 29 | 5 | 4 | 0.40 | 2 | 0 | 0 | 0 | 2.833333 | 0.0 |
| 1071 | 28 | 2 | 3 | 0.30 | 3 | 0 | 0 | 1 | 5.833333 | 0.0 |
| 716 | 62 | 37 | 4 | 3.40 | 2 | 0 | 0 | 0 | 7.083333 | 0.0 |
print("{0:0.2f}% data is in training set".format((len(x_train)/len(data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(data.index)) * 100))
67.80% data is in training set 29.06% data is in test set
print("Original Personal_Loan True Values : {0} ({1:0.2f}%)".format(len(data.loc[data['Personal_Loan'] == 1]), (len(data.loc[data['Personal_Loan'] == 1])/len(data.index)) * 100))
print("Original Personal_Loan False Values : {0} ({1:0.2f}%)".format(len(data.loc[data['Personal_Loan'] == 0]), (len(data.loc[data['Personal_Loan'] == 0])/len(data.index)) * 100))
print("")
print("Training Personal_Loan True Values : {0} ({1:0.2f}%)".format(len(Y_train[Y_train[:] == 1]), (len(Y_train[Y_train[:] == 1])/len(Y_train)) * 100))
print("Training Personal_Loan False Values : {0} ({1:0.2f}%)".format(len(Y_train[Y_train[:] == 0]), (len(Y_train[Y_train[:] == 0])/len(Y_train)) * 100))
print("")
print("Test Personal_Loan True Values : {0} ({1:0.2f}%)".format(len(Y_test[Y_test[:] == 1]), (len(Y_test[Y_test[:] == 1])/len(Y_test)) * 100))
print("Test Personal_Loan False Values : {0} ({1:0.2f}%)".format(len(Y_test[Y_test[:] == 0]), (len(Y_test[Y_test[:] == 0])/len(Y_test)) * 100))
print("")
Original Personal_Loan True Values : 480 (9.60%) Original Personal_Loan False Values : 4520 (90.40%) Training Personal_Loan True Values : 319 (9.41%) Training Personal_Loan False Values : 3071 (90.59%) Test Personal_Loan True Values : 129 (8.88%) Test Personal_Loan False Values : 1324 (91.12%)
model=LogisticRegression(solver="liblinear")
model.fit(x_train,Y_train)
LogisticRegression(solver='liblinear')
y_predict_train=model.predict(x_train)
coef_df=pd.DataFrame(model.coef_)
coef_df["Intercept"] = model.intercept_
print(coef_df.T)
0 0 -0.444026 1 0.445703 2 0.588667 3 0.179231 4 1.656337 5 -0.764237 6 3.080269 7 -0.883203 8 0.681039 9 0.124122 Intercept -2.337203
def Confusion_Matrix(yTest,ypredict):
cm= metrics.confusion_matrix(yTest,ypredict,labels=[1,0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt="d")
Confusion_Matrix(Y_train,y_predict_train)
y_predict_test=model.predict(x_test)
coef_df=pd.DataFrame(model.coef_)
coef_df["Intercept"] = model.intercept_
print(coef_df.T)
0 0 -0.444026 1 0.445703 2 0.588667 3 0.179231 4 1.656337 5 -0.764237 6 3.080269 7 -0.883203 8 0.681039 9 0.124122 Intercept -2.337203
Confusion_Matrix(Y_test,y_predict_test)
model_score=model.score(x_test,Y_test)
print(model_score)
0.9545767377838954
print('Accuracy on train data:',accuracy_score(Y_train, y_predict_train) )
print('Accuracy on test data:',accuracy_score(Y_test, y_predict_test))
Accuracy on train data: 0.9533923303834808 Accuracy on test data: 0.9545767377838954
#AUC ROC curve
model_roc_auc = roc_auc_score(Y_test, model.predict_proba(x_test)[:,1])
fpr, tpr, thresholds = roc_curve(Y_test, model.predict_proba(x_test)[:,1])
plt.figure(figsize=(15,10))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % model_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
# The optimal cut off would be where tpr -true positive rate- is high and fpr - false posive rate - is low.
# This is done on train data
fpr, tpr, thresholds = roc_curve(Y_train, model.predict_proba(x_train)[:,1])
optimal_idx = np.argmax(tpr - fpr) # Calculates the maximum distance between the tpr and fpr
optimal_threshold = thresholds[optimal_idx]
print ('Optimal Threshold:',(optimal_threshold))
Optimal Threshold: 0.12310110478623998
# If probability is greater than the optimal threshold defined above return 1 else return 0
y_pred_tr = (model.predict_proba(x_train)[:,1]>optimal_threshold).astype(int)
y_pred_ts = (model.predict_proba(x_test)[:,1]>optimal_threshold).astype(int)
Confusion_Matrix(Y_test,y_pred_ts)
print('Accuracy on train data:',accuracy_score(Y_train, y_pred_tr) )
print('Accuracy on test data:',accuracy_score(Y_test, y_pred_ts))
Accuracy on train data: 0.9091445427728614 Accuracy on test data: 0.9153475567790778
# dataframe with numerical column only
num_feature_set = x.copy()
num_feature_set = add_constant(num_feature_set)
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))
Series before feature selection: const 451.718498 Age 91.645484 Experience 91.538686 Family 1.028579 CCAvg 1.599822 Education 1.102838 Securities_Account 1.134855 CD_Account 1.285075 CreditCard 1.105546 Income_Per_Month 1.717624 Mortgage_Per_Month 1.047007 dtype: float64
variables_with_prefect_collinearity = vif_series1[vif_series1.values==np.inf].index.tolist()
variables_with_prefect_collinearity
[]
X_train, X_test, y_train, y_test = train_test_split(num_feature_set, Y, test_size=0.30)
reg = sm.Logit(y_train, X_train)
lg = reg.fit()
Optimization terminated successfully.
Current function value: 0.123533
Iterations 9
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3390
Model: Logit Df Residuals: 3379
Method: MLE Df Model: 10
Date: Sun, 09 May 2021 Pseudo R-squ.: 0.6056
Time: 06:28:38 Log-Likelihood: -418.78
converged: True LL-Null: -1061.9
Covariance Type: nonrobust LLR p-value: 3.412e-270
======================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------
const -14.2667 2.014 -7.082 0.000 -18.215 -10.319
Age -0.0191 0.074 -0.259 0.795 -0.164 0.125
Experience 0.0274 0.073 0.374 0.708 -0.116 0.171
Family 0.7304 0.091 8.018 0.000 0.552 0.909
CCAvg 0.2076 0.052 3.983 0.000 0.105 0.310
Education 1.7425 0.144 12.088 0.000 1.460 2.025
Securities_Account -0.8734 0.372 -2.346 0.019 -1.603 -0.144
CD_Account 3.3617 0.395 8.507 0.000 2.587 4.136
CreditCard -0.9400 0.243 -3.874 0.000 -1.416 -0.464
Income_Per_Month 0.7370 0.043 17.228 0.000 0.653 0.821
Mortgage_Per_Month 0.1176 0.259 0.455 0.649 -0.390 0.625
======================================================================================
Calculate the odds ratio from the coef using the formula odds ratio=exp(coef) Calculate the probability from the odds ratio using the formula probability = odds / (1+odds)
#Calculate Odds Ratio, probability
##create a data frame to collate Odds ratio, probability and p-value of the coef
lgcoef = pd.DataFrame(lg.params, columns=['coef'])
lgcoef.loc[:, "Odds_ratio"] = np.exp(lgcoef.coef)
lgcoef['probability'] = lgcoef['Odds_ratio']/(1+lgcoef['Odds_ratio'])
lgcoef['pval']=lg.pvalues
pd.options.display.float_format = '{:.2f}'.format
# Filter by significant p-value (pval <0.005) and sort descending by Odds ratio
lgcoef = lgcoef.sort_values(by="Odds_ratio", ascending=False)
pval_filter = lgcoef['pval']<=0.005 # Filters significant variables by selecting values less than 0.005
lgcoef[pval_filter]
| coef | Odds_ratio | probability | pval | |
|---|---|---|---|---|
| CD_Account | 3.36 | 28.84 | 0.97 | 0.00 |
| Education | 1.74 | 5.71 | 0.85 | 0.00 |
| Income_Per_Month | 0.74 | 2.09 | 0.68 | 0.00 |
| Family | 0.73 | 2.08 | 0.67 | 0.00 |
| CCAvg | 0.21 | 1.23 | 0.55 | 0.00 |
| CreditCard | -0.94 | 0.39 | 0.28 | 0.00 |
| const | -14.27 | 0.00 | 0.00 | 0.00 |
# we are looking are overall significant varaible
pval_filter = lgcoef['pval']<=0.0001
imp_vars = lgcoef[pval_filter].index.tolist()
# we are going to get overall varaibles (un-one-hot encoded varables) from categorical varaibles
sig_var = []
for col in imp_vars:
if '_' in col:
first_part = col.split('_')[0]
for c in data.columns:
if first_part in c and c not in sig_var :
sig_var.append(c)
start = '\033[1m'
end = '\033[95m'
print('Most significant varaibles category wise are :\n',lgcoef[pval_filter].index.tolist())
print('*'*120)
print(start+'Most overall significant varaibles are '+end,':\n',sig_var)
Most significant varaibles category wise are : ['CD_Account', 'Education', 'Income_Per_Month', 'Family', 'CCAvg', 'const'] ************************************************************************************************************************ Most overall significant varaibles are : ['CD_Account', 'Income']
pred_train = lg.predict(X_train)
pred_train = np.round(pred_train)
print("Confusion Matrix = \n")
Confusion_Matrix(y_train,pred_train)
Confusion Matrix =
pred_ts = lg.predict(X_test)
pred_ts = np.round(pred_ts)
print("Confusion Matrix = \n")
Confusion_Matrix(y_test,pred_ts )
Confusion Matrix =
print('Accuracy on train data:',accuracy_score(y_train, pred_train) )
print('Accuracy on test data:',accuracy_score(y_test, pred_ts))
Accuracy on train data: 0.952212389380531 Accuracy on test data: 0.9621472814865795
fpr, tpr, thresholds = roc_curve(y_test, lg.predict(X_test))
plt.figure(figsize=(13,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % model_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
pred_train = lg.predict(X_train)
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, pred_train)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print('Optimal Threshold:',(optimal_threshold))
Optimal Threshold: 0.10170864235068788
y_pred_tr = (lg.predict(X_train)>optimal_threshold).astype(int)
y_pred_ts = (lg.predict(X_test)>optimal_threshold).astype(int)
Confusion_Matrix(y_train,y_pred_tr )
Confusion_Matrix(y_test,y_pred_ts)
#Accuracy with optimal threshold
print('Accuracy on train data:',accuracy_score(y_train,y_pred_tr) )
print('Accuracy on test data:',accuracy_score(y_test,y_pred_ts))
Accuracy on train data: 0.8988200589970502 Accuracy on test data: 0.9002064693737095
xtrain, xtest, ytrain, ytest = train_test_split(x, Y, test_size=0.3 , random_state=1)
print(xtrain.shape)
print(xtest.shape)
print(ytrain.shape)
print(ytest.shape)
(3390, 10) (1453, 10) (3390,) (1453,)
clf = DecisionTreeClassifier(criterion = 'gini', random_state=1)
clf.fit(xtrain,ytrain)
DecisionTreeClassifier(random_state=1)
# accuracy on training set
print("Accuracy on train set", clf.score(xtrain,ytrain))
# accuracy on test set
print("Accuracy on test set", clf.score(xtest,ytest))
Accuracy on train set 1.0 Accuracy on test set 0.9848589125946318
y_predict_train = clf.predict(xtrain)
y_predict_test = clf.predict(xtest)
Confusion_Matrix(ytrain,y_predict_train)
Confusion_Matrix(ytest,y_predict_test)
print("Recall on training set : ", metrics.recall_score(ytrain,y_predict_train))
print("Recall on test set : ", metrics.recall_score(ytest,y_predict_test))
Recall on training set : 1.0 Recall on test set : 0.8914728682170543
column_names = list(x.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Family', 'CCAvg', 'Education', 'Securities_Account', 'CD_Account', 'CreditCard', 'Income_Per_Month', 'Mortgage_Per_Month']
plt.figure(figsize=(20,30))
out = tree.plot_tree(clf,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(clf,feature_names=feature_names,show_weights=True))
|--- Income_Per_Month <= 9.46 | |--- CCAvg <= 2.95 | | |--- Income_Per_Month <= 8.88 | | | |--- weights: [2498.00, 0.00] class: 0 | | |--- Income_Per_Month > 8.88 | | | |--- CCAvg <= 1.95 | | | | |--- CCAvg <= 1.85 | | | | | |--- CreditCard <= 0.50 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Family <= 3.50 | | | | | | | | |--- Age <= 29.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 29.50 | | | | | | | | | |--- Education <= 1.50 | | | | | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | | | | | | |--- Education > 1.50 | | | | | | | | | | |--- Mortgage_Per_Month <= 0.57 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- Mortgage_Per_Month > 0.57 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Family > 3.50 | | | | | | | | |--- Experience <= 7.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Experience > 7.50 | | | | | | | | | |--- Age <= 59.50 | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- Age > 59.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- CreditCard > 0.50 | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | |--- CCAvg > 1.85 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CCAvg > 1.95 | | | | |--- weights: [25.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income_Per_Month <= 7.46 | | | | |--- Experience <= 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Experience > 0.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- Mortgage_Per_Month <= 0.58 | | | | | | | |--- CCAvg <= 3.45 | | | | | | | | |--- Income_Per_Month <= 6.71 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income_Per_Month > 6.71 | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | |--- Experience <= 25.00 | | | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | | | | |--- Experience > 25.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 3.45 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Mortgage_Per_Month > 0.58 | | | | | | | |--- Mortgage_Per_Month <= 0.69 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Mortgage_Per_Month > 0.69 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income_Per_Month <= 6.79 | | | | | | | |--- weights: [42.00, 0.00] class: 0 | | | | | | |--- Income_Per_Month > 6.79 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | |--- Mortgage_Per_Month <= 0.26 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Mortgage_Per_Month > 0.26 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | |--- Income_Per_Month > 7.46 | | | | |--- Education <= 1.50 | | | | | |--- Family <= 3.50 | | | | | | |--- CCAvg <= 4.50 | | | | | | | |--- Mortgage_Per_Month <= 0.11 | | | | | | | | |--- Experience <= 10.00 | | | | | | | | | |--- Experience <= 7.50 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | |--- Experience > 7.50 | | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- Experience > 10.00 | | | | | | | | | |--- Income_Per_Month <= 8.33 | | | | | | | | | | |--- CCAvg <= 4.35 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 4.35 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income_Per_Month > 8.33 | | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | |--- Mortgage_Per_Month > 0.11 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | |--- CCAvg > 4.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- Income_Per_Month <= 8.58 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Income_Per_Month > 8.58 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Education > 1.50 | | | | | |--- Family <= 2.50 | | | | | | |--- Income_Per_Month <= 8.67 | | | | | | | |--- Experience <= 31.00 | | | | | | | | |--- Age <= 30.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 30.50 | | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | | |--- Experience > 31.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Income_Per_Month > 8.67 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | |--- Family > 2.50 | | | | | | |--- Experience <= 35.50 | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | | | |--- Experience > 35.50 | | | | | | | |--- Mortgage_Per_Month <= 0.19 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- Mortgage_Per_Month > 0.19 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.40 | | | | |--- Age <= 62.50 | | | | | |--- weights: [0.00, 12.00] class: 1 | | | | |--- Age > 62.50 | | | | | |--- Income_Per_Month <= 8.83 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Income_Per_Month > 8.83 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CCAvg > 4.40 | | | | |--- Income_Per_Month <= 8.46 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- Income_Per_Month > 8.46 | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Income_Per_Month > 9.46 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [336.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 50.00] class: 1 | |--- Education > 1.50 | | |--- Income_Per_Month <= 9.71 | | | |--- CCAvg <= 4.00 | | | | |--- Experience <= 32.00 | | | | | |--- Family <= 3.50 | | | | | | |--- Age <= 44.50 | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | |--- Age > 44.50 | | | | | | | |--- Family <= 1.50 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- Family > 1.50 | | | | | | | | |--- Income_Per_Month <= 9.54 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Income_Per_Month > 9.54 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Experience > 32.00 | | | | | |--- weights: [5.00, 0.00] class: 0 | | | |--- CCAvg > 4.00 | | | | |--- weights: [0.00, 4.00] class: 1 | | |--- Income_Per_Month > 9.71 | | | |--- weights: [0.00, 205.00] class: 1
importances = clf.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [5, 10,15,20,25,30],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(xtrain, ytrain)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(xtrain, ytrain)
DecisionTreeClassifier(max_depth=7, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)
pred_train = estimator.predict(xtrain)
pred_test = estimator.predict(xtest)
Confusion_Matrix(pred_test,ytest)
print("Recall on training set: ",metrics.recall_score(ytrain,pred_train))
print("Recall on test set: ",metrics.recall_score(ytest,pred_test))
Recall on training set: 1.0 Recall on test set: 0.8914728682170543
column_names = list(x.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Family', 'CCAvg', 'Education', 'Securities_Account', 'CD_Account', 'CreditCard', 'Income_Per_Month', 'Mortgage_Per_Month']
plt.figure(figsize=(10,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
ccp = DecisionTreeClassifier(random_state=1)
path = ccp.cost_complexity_pruning_path(xtrain, ytrain)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00 | 0.00 |
| 1 | 0.00 | 0.00 |
| 2 | 0.00 | 0.00 |
| 3 | 0.00 | 0.00 |
| 4 | 0.00 | 0.00 |
| 5 | 0.00 | 0.00 |
| 6 | 0.00 | 0.00 |
| 7 | 0.00 | 0.01 |
| 8 | 0.00 | 0.01 |
| 9 | 0.00 | 0.01 |
| 10 | 0.00 | 0.01 |
| 11 | 0.00 | 0.01 |
| 12 | 0.00 | 0.01 |
| 13 | 0.00 | 0.01 |
| 14 | 0.00 | 0.01 |
| 15 | 0.00 | 0.02 |
| 16 | 0.00 | 0.02 |
| 17 | 0.00 | 0.02 |
| 18 | 0.00 | 0.02 |
| 19 | 0.00 | 0.02 |
| 20 | 0.00 | 0.02 |
| 21 | 0.00 | 0.03 |
| 22 | 0.00 | 0.03 |
| 23 | 0.01 | 0.04 |
| 24 | 0.03 | 0.06 |
| 25 | 0.05 | 0.17 |
fig, ax = plt.subplots(figsize=(15,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(xtrain, ytrain)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.05345396264193413
clfs and ccp_alphas, the chart will show that less depth means less overfitting¶clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
train_scores = [clf.score(xtrain, ytrain) for clf in clfs]
test_scores = [clf.score(xtest, ytest) for clf in clfs]
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 88% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. In this example, setting ccp_alpha=0.015 maximizes the testing accuracy.
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(xtrain)
values_train=metrics.recall_score(ytrain,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(xtest)
values_test=metrics.recall_score(ytest,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)
pred_train = best_model.predict(xtrain)
pred_test = best_model.predict(xtest)
print("Recall on training set : ",metrics.recall_score(ytrain,pred_train))
print("Recall on test set : ",metrics.recall_score(ytest,pred_test))
Recall on training set : 1.0 Recall on test set : 0.8914728682170543
best_model.fit(xtrain, ytrain)
DecisionTreeClassifier(random_state=1)
y_predict = best_model.predict(xtest)
Confusion_Matrix(ytest,y_predict)
plt.figure(figsize=(10,10))
out = tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model,feature_names=feature_names,show_weights=True))
|--- Income_Per_Month <= 9.46 | |--- CCAvg <= 2.95 | | |--- Income_Per_Month <= 8.88 | | | |--- weights: [2498.00, 0.00] class: 0 | | |--- Income_Per_Month > 8.88 | | | |--- CCAvg <= 1.95 | | | | |--- CCAvg <= 1.85 | | | | | |--- CreditCard <= 0.50 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Family <= 3.50 | | | | | | | | |--- Age <= 29.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 29.50 | | | | | | | | | |--- Education <= 1.50 | | | | | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | | | | | | |--- Education > 1.50 | | | | | | | | | | |--- Mortgage_Per_Month <= 0.57 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- Mortgage_Per_Month > 0.57 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Family > 3.50 | | | | | | | | |--- Experience <= 7.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Experience > 7.50 | | | | | | | | | |--- Age <= 59.50 | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- Age > 59.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- CreditCard > 0.50 | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | |--- CCAvg > 1.85 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CCAvg > 1.95 | | | | |--- weights: [25.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income_Per_Month <= 7.46 | | | | |--- Experience <= 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Experience > 0.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- Mortgage_Per_Month <= 0.58 | | | | | | | |--- CCAvg <= 3.45 | | | | | | | | |--- Income_Per_Month <= 6.71 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income_Per_Month > 6.71 | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | |--- Experience <= 25.00 | | | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | | | | |--- Experience > 25.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 3.45 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Mortgage_Per_Month > 0.58 | | | | | | | |--- Mortgage_Per_Month <= 0.69 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Mortgage_Per_Month > 0.69 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income_Per_Month <= 6.79 | | | | | | | |--- weights: [42.00, 0.00] class: 0 | | | | | | |--- Income_Per_Month > 6.79 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | |--- Mortgage_Per_Month <= 0.26 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Mortgage_Per_Month > 0.26 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | |--- Income_Per_Month > 7.46 | | | | |--- Education <= 1.50 | | | | | |--- Family <= 3.50 | | | | | | |--- CCAvg <= 4.50 | | | | | | | |--- Mortgage_Per_Month <= 0.11 | | | | | | | | |--- Experience <= 10.00 | | | | | | | | | |--- Experience <= 7.50 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | |--- Experience > 7.50 | | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- Experience > 10.00 | | | | | | | | | |--- Income_Per_Month <= 8.33 | | | | | | | | | | |--- CCAvg <= 4.35 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 4.35 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income_Per_Month > 8.33 | | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | |--- Mortgage_Per_Month > 0.11 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | |--- CCAvg > 4.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- Income_Per_Month <= 8.58 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Income_Per_Month > 8.58 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Education > 1.50 | | | | | |--- Family <= 2.50 | | | | | | |--- Income_Per_Month <= 8.67 | | | | | | | |--- Experience <= 31.00 | | | | | | | | |--- Age <= 30.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 30.50 | | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | | |--- Experience > 31.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Income_Per_Month > 8.67 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | |--- Family > 2.50 | | | | | | |--- Experience <= 35.50 | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | | | |--- Experience > 35.50 | | | | | | | |--- Mortgage_Per_Month <= 0.19 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- Mortgage_Per_Month > 0.19 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.40 | | | | |--- Age <= 62.50 | | | | | |--- weights: [0.00, 12.00] class: 1 | | | | |--- Age > 62.50 | | | | | |--- Income_Per_Month <= 8.83 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Income_Per_Month > 8.83 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CCAvg > 4.40 | | | | |--- Income_Per_Month <= 8.46 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- Income_Per_Month > 8.46 | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Income_Per_Month > 9.46 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [336.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 50.00] class: 1 | |--- Education > 1.50 | | |--- Income_Per_Month <= 9.71 | | | |--- CCAvg <= 4.00 | | | | |--- Experience <= 32.00 | | | | | |--- Family <= 3.50 | | | | | | |--- Age <= 44.50 | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | |--- Age > 44.50 | | | | | | | |--- Family <= 1.50 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- Family > 1.50 | | | | | | | | |--- Income_Per_Month <= 9.54 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Income_Per_Month > 9.54 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Experience > 32.00 | | | | | |--- weights: [5.00, 0.00] class: 0 | | | |--- CCAvg > 4.00 | | | | |--- weights: [0.00, 4.00] class: 1 | | |--- Income_Per_Month > 9.71 | | | |--- weights: [0.00, 205.00] class: 1
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'], 'Train_Recall':[1.0,0.91,1.0], 'Test_Recall':[0.87,0.86,0.89]})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 1.00 | 0.87 |
| 1 | Decision treee with hyperparameter tuning | 0.91 | 0.86 |
| 2 | Decision tree with post-pruning | 1.00 | 0.89 |
Best prediction:
Most Significant variables:
Segments to be targeted: